"visit with us" - travel package purchase prediction - ensemble techniques

Context:

Objective:

To harness the available data of existing and potential customers to make the marketing expenditure more efficient and to predict which customer is more likely to purchase the newly introduced travel package.

Key questions to be answered:

  1. What are the key variables in identifying potential customers for the travel packages?
  2. What are the different characteristics of the existing customers?
  3. What is the most important metric for the model and possible improvements using tuning?

Data Dictionary

Customer details:

  1. CustomerID: Unique customer ID
  2. ProdTaken: Whether the customer has purchased a package or not (0: No, 1: Yes)
  3. Age: Age of customer
  4. TypeofContact: How customer was contacted (Company Invited or Self Inquiry)
  5. CityTier: City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3
  6. Occupation: Occupation of customer
  7. Gender: Gender of customer
  8. NumberOfPersonVisiting: Total number of persons planning to take the trip with the customer
  9. PreferredPropertyStar: Preferred hotel property rating by customer
  10. MaritalStatus: Marital status of customer
  11. NumberOfTrips: Average number of trips in a year by customer
  12. Passport: The customer has a passport or not (0: No, 1: Yes)
  13. OwnCar: Whether the customers own a car or not (0: No, 1: Yes)
  14. NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer
  15. Designation: Designation of the customer in the current organization
  16. MonthlyIncome: Gross monthly income of the customer

Customer interaction data:

  1. PitchSatisfactionScore: Sales pitch satisfaction score
  2. ProductPitched: Product pitched by the salesperson
  3. NumberOfFollowups: Total number of follow-ups has been done by the salesperson after the sales pitch
  4. DurationOfPitch: Duration of the pitch by a salesperson to the customer

Import required libraries

Define all required functions

Load the dataset

Understand the data

Check the shape of data

Check the dataset information

Observations:

  1. TypeofContact, Occupation, Gender, ProductPitched, MaritalStatus, Designation are object type columns but contain categorical information. We will convert these columns to Category type.
  2. PreferredPropertyStar, NumberOfChildrenVisiting, NumberOfPersonVisiting, PitchSatisfactionScore, NumberOfFollowups, OwnCar, Passport, CityTier and ProdTaken are numerical columns but contain categorical information. We will covert these columns to Category type.

Check the sample data

Data pre-processing

Convert the Categorical columns to Category Datatype

Observations:

The datatype of Categorical columns is fixed now

Check the missing values

Observations:

  1. There are 7 columns with null values.
  2. Age and MonthlyIncome are usually linked to Designation and since there are no missing values in Designation we can use it as a basis to update the missing values in Age and MonthlyIncome.
  3. NumberOfChildrenVisiting, NumberOfTrips, PreferredPropertyStar, NumberOfFollowups, DurationOfPitch can be populated with Median values of the column.
  4. For TypeofContact we will have to impute values based on available data in the column

Treat Age and MonthlyIncome for missing values

Treat other numerical columns for missing values

Check the data for category columns

Observations:

  1. In the Gender column, we have an error value Fe Male. We will treat this as an data entry issue and replace it to Female.
  2. Self Inquiry is the most preffered in TypeofContact feature.
  3. 3.0 is the highest property rating
  4. And 1.0 is the highest value for the NumberOfChildrenVisiting column.
  5. Hence we will replace the missing values in the above columns accordingly

Treat the other columns for missing values

Treat the error in Gender column

Observations:

The error in data is resolved now

Verify the missing value treatment

Observations:

All the null values are treated now

Data verification

Summary of numerical columns

Summary of categorical columns

Observations:

  1. Self Inquiry is the most preffered Type of Contact
  2. ProdTaken : There is heavy imbalance in this column where atleast 80% customers did not purchase any product
  3. CityTier : Most customers are from Tier 1
  4. Occupation : Most customers earn a salary
  5. Gender : Male customers are slightly higher than Female Customers
  6. NoOfPersonsVisting: Most customers plan to take atleast 3 additional persons with them in the trip
  7. ProductPitched : Basic is the popular product
  8. MaritalStatus : Most customers are married
  9. Passport : Most customers dont have a passport
  10. PitchSatisfactionScore : Most customers have rated 3.0
  11. OwnCar: Most customers own a car
  12. NumberofChildrenVisting : Most customers plan to take atleast 1 child under five with them for the trip.
  13. Designation : Most customers belong to Executive designation

Univariate Analysis

CustomerID

Observations:

  1. CustomerID shows balanced data as it is sequential tracking number.
  2. CutomerID is only for identifying the customer record.
  3. We will not use this column in model building.

ProdTaken - Target Variable

Observations:

We see that only 18.8% of the total customers purchased any of the travel package.The plot shows heavy imbalance in the dataset.

Age

Observations:

Age variable is almost normally distributed with no outliers. we see that most customers are in the age brackets 30- 45 yrs.

TypeofContact

Observations:

Self-Enquiry is the most preferred contact method by the customers at 71%.

CityTier

Observations:

65.3% of customers are from Tier 1 cities and Tier3 cities comes second at 30.7%.

DurationOfPitch

Observations:

  1. DurationofPitch is slightly right-skewed.
  2. We see that most customer"s pitch duration was under 20 mins.
  3. We also see few outliers at 40 mins and at 120+ mins.

Occupation

Observations:

  1. 48.4% of customers are Salaried.
  2. Customers with Small Business are the next highest in Occupation at 42.6%.
  3. There are very few Free Lancers as well.

Gender

Observations:

Number of Male customers(59.7%) are higher than Female customers (40.3%).

NumberOfPersonVisiting

Observations:

  1. 49.1% of customers plan to take at least 3 persons with them during trip.
  2. Around 29% customers want to take 2 people.
  3. 21% customers want to take 4 additional persons with them during their travel.

NumberOfFollowups

Observations:

The highest number of followups is 4.0 followed by 3.0.

ProductPitched

Observations:

  1. Basic(37.7%) and Deluxe(35.4%) are the most popular travel packages.
  2. The next slightly popular one is the Standard Travel package at 15.2%.

PreferredPropertyStar

Observations:

61.8% customers prefer a three star hotel rating compared to four (18.7%) and five (19.6%) star rating hotels

MaritalStatus

Observations:

  1. Married customers form the bulk of the data at 47.9%.
  2. Divorced (19.4%) and Single (18.7%) coming in close at second and third place.
  3. Unmarried customers with partners form 14% of the data.

NumberOfTrips

Observations:

  1. NumberofTrips is right-skewed a little and majority of the customers seem to take at least 2 trips per year.
  2. We also see very few outliers in the higher end.

Passport

Observations:

Only 29.1% of customers have a passport

PitchSatisfactionScore

Observations:

  1. Only 30.2% of customers rated the Sales Pitch with a score of 3.
  2. Even though 18.7% customers rated at 4.
  3. 19.8% rated a pitch score of 5.
  4. we also see that 19.3% rated the Sales pitch score at 1.
  5. This shows a need for improvement in this area.

OwnCar

Observations:

62% Customers have their own cars

NumberOfChildrenVisiting

Observations:

Around 43.9% of customers have at least one child under age Five are planning to accompany them in the travels.

Designation

Observations:

Executive (37.7%) and Manager(35.4%) are the highest Designations of the customers in the dataset.

MonthlyIncome

Observation:

  1. MonthlyIncome is right-skewd.
  2. However, we see that the majority of customers are between income bracket 20K dollars and 30K dollars.
  3. We also see two outliers in the low end and on the highest end.
  4. There are several outliers after the approx 35K dollars income level.

Bivariate Analysis

Comparision of Numerical Variables with ProdTaken to understand the relation

Observations:

  1. The mean Age for customers who purchased any Product is slightly less than those who didn"t.
  2. The mean DurationofPitch for both classed of ProdTaken is almost equal. We see there are many outliers in Class "0" of ProdTaken, suggesting that longer pitch durations doesn"t lead to product purchase.
  3. Customers who purchased the packages had an average of at least four followups, compared to customers who didnt.
  4. The Averages for NumberofTrips and MonthlyIncome;for both Classes of ProdTaken is almost equal.
  5. MonthlyIncome variable has several outliers in the higher end for both ProdTaken classes and very few in low end of Class "0".
  6. We also see that Age variable doesn"t have any outliers.
  7. CustomerID Column is not relevant for analysis, we will exclude the same from model building.

TypeofContact VS ProdTaken

Observations:

More Customers with "Company Invited" contact have bought Travel Package when compared to Customers with "Self Enquiry".

CityTier VS ProdTaken

Observations:

More Customers from Tier 2 and 3 cities have purchased Travel Packages.

Occupation VS ProdTaken

Observations:

  1. Customers who are Freelancers by Occupation have bought travel packages. However the sample size is only two.
  2. Of the 434 Large Business owning customers, almost 30% bought travel packages.
  3. Among Salaried and Small Business owning customers,close to 20% have bought travel packages.

Gender VS ProdTaken

Observations:

Number of Male customers are higher than Female customers, however we dont see a lot of difference in the percentage of each Gender select the Product.

NumberOfPersonVisiting VS ProdtTaken

Observations:

  1. Customers who plan to take between 2-4 persons with them during travel, close to 20% have bought a travel package product.
  2. We see that all Customers with one companion and five companions, did not purchase any product.
  3. This suggests that the products don"t seem either appealing or beneficial to the customers of the above two categories.
  4. Business should focus on this area.

ProductPitched VS ProdTaken

Observations:

  1. The Basic Package is the most preferred
  2. Standard and Deluxe are following.
  3. Very few customers purchased Super Deluxe products.

PreferredPropertyStar

Observations:

  1. Though majority of customers prefer a 3.0 star rated Property, the percentage of customers purchasing the products is comparatively less than customers who prefer a 4.0 and 5.0 star rated property.
  2. The higher the property star rating, higher the number of customers who purchased a product.

MaritalStatus VS ProdTaken

Observations:

  1. Around 30% of all Single customers have bought a product and about 25% of Unmarried customers have also purchased a product.
  2. Almost 50% of the total customers belong to the married category, but we see that only approx 15% of them have actually purchased any product.

Passport VS ProdTaken

Observations:

Customers with passport tend to purchase products than those who don"t.

PitchSatisfactionScore VS ProdTaken

Observations:

  1. Majority of customers have given a score of 3.0 to the Sale pitch for the products.
  2. But we observe that the number of customers who purchased any product is almost equal across all pitch scores.

OwnCar VS ProdTaken

Observations:

There is hardly any difference in the percentage of customers with or without cars, purchasing the product.

NumberOfChildrenVisiting VS ProdTaken

Observations:

We see that the percentage of customers who purchased a product is fairly same across all categories of variable NumberOfChildrenVisiting.

Designation

Observations:

  1. Around 30% Customers with Executive Designation have purchased a product.
  2. Sr. Manager - 16% and Manager - 11% Designation customers have purchased a product.
  3. Very few customers of VP and AVP Designation have purchased a product.

Correlation Matrix

Observations:

  1. The correlation values are quite low between all the variables.
  2. Only Age and DurationofPitch have a very low negative correlation.
  3. MonthlyIncome and Age have the highest positive correlation at 0.47; i.e as Age increases, so does MontlyIncome
  4. NumberofFollowups and NumberofTrips have a moderate positive correlation between them and also individually with Monthly Income.

Outliers Treatment

Lets check the percentage of Outliers with IQR

Observations:

  1. MonthlyIncome and NumberOfFollowups have high outliers.
  2. DurationOfPitch and NumberOfTrips also have few outliers.
  3. Since we are building DecisionTree Based models and they are not influenced by Outliers, we can choose not to treat these outliers.

Model Building

Model Evaluation Criterion

Model can make two kinds of wrong predictions:

  1. Predicting that the customer will purchase a Travel Package when they don"t. - False Positive.
  2. Predicting that the customer will not purchase a Travel Package when they do. - False Negative.

The Travel company"s objectives are:

  1. Make Marketing Expenditure more efficient and focused on the customers that would actually purchase the product.
  2. Predict and Identify all potential customers who will purchase the newly introduced travel package.

Metric for Optimization:

For the above objectives, its important that both False positive and False negative values are low. Hence we would want the F1-Score to be maximized. The greater the F1-Score, greater the chances of predicting both classes correctly.

We will build following models, tune them and compare the outcome of all the models:

  1. Decision Tree model.
  2. Bagging Classifier.
  3. Random Forest Classifier.
  4. ADABoost.
  5. GradientBoost.
  6. XG Boost.
  7. Stacking Classifier.

Split the data in Train and Test Sets

Customer Interaction data is not relevant for our analysis and we will ignore. Also CustomerID column is not required

Create dummy variables for the categorical columns

Split the data in Training and Testing Sets

Check the split of target variable ProdTaken

Decision Tree Classifier

Check the scores

Visualise the tree

Draw the confusion matrix

Check the important variables

Observations:

  1. The model tends to over fit the training set.
  2. F1 Score for testing set is 0.60.
  3. Age and MonthlyIncome are most important variables.
  4. It is difficult to read and understand the tree when drawn.
  5. Since the model over fits the training set, we will use GridSearchCV to find the optimum parameters values and hypertine the Decistion Tree Classifier

Hypertuned Decision Tree Classifier

Visualise the Tree

Check the scores

Draw the confusion matrix

Check the important variables

Observations:

  1. F1Score has decreased to 0.51 for Train set and 0.49 for test set.
  2. The performance values for F1Score are close and comparable on the tuned Decision Tree.
  3. Passport_1 and Designation_Executive are the new important variables that are considering by the tuned Decision Tree.
  4. The tree is readable when drawn.

Bagging Classifier

Check the scores

Draw the confusion matrix

Observations:

  1. The model is over fitting.
  2. The Bagging classifier has a better accuracy metric and the F1 score is also higher.
  3. But model only predicts 10.09% of the total 13% of True positives.

Bagging Classifier with Logistic_Regression

Check the scores

Draw the confusion matrix

Observations:

  1. Use the Logistic_Regression as base estimator does not fit for our analysis as it has reduced the f1_score to zero.
  2. The model is not able to identify any true positives.
  3. However the model is not over fitting, it also gives us the comparable accuracy on both training and testing sets.
  4. Let us try Bagging Classifier with base_estimator as DecisionTreeClassifier.

Bagging Classifier with Decision Tree

Check the scores

Draw the confusion matrix

Observations:

  1. The model with weighted decision tree hasn"t improved the metrics.
  2. The true positive"s identified are even lesser.

Hypertuned Bagging Classifier

Check the scores

Draw the confusion matrix

Observations:

  1. We see that the Train and Test Accuracy and F1Score Performance has increased after tuning compared to the previous models.
  2. The Model is over-fitting as the difference between Train and Test scores are very high.
  3. The Model seems to identify all non-buyers better as the False Positive value is low.

Random Forest Classifier

Check the scores

Draw the confusion matrix

Check the important variables

Observations:

  1. Random Forest classifier is also overfitting on the training set.
  2. The F1 score metric has also reduced.
  3. Monthly_Income and Age are the most important variables.

Random Forest Classifier with weights

Check the scores

Draw the confusion matrix

Check the important variables

Observations:

  1. There is no improvement by adding weights to the Random Forest classifier.
  2. MonthlyIncome and Age are still the most important variables.

Hypertuned Random Forest Classifier

Check the scores

Draw the confusion matrix

Check the important variables

Observations:

  1. The metrics have dropped after hypertuning.
  2. The model is not overfitting.
  3. F1 Score has reduced, but the score is comparable between training and testing sets.
  4. The most importance features for this model are:
    • MonthlyIncome
    • Age
    • Passport_1(Customers with Passport)
  5. This model gives 85% accuracy rate, which is quite good despite the imbalance in data.

ADA Boost Classifier

Check the scores

Draw the confusion matrix

Observations:

  1. The metrics for ADA boost model is close and comparable for train and test set.
  2. F1 Score is too loa.
  3. The model identifies 5.25% out of 13% true positives.

Hypertuned ADA Boost Classifier

Check the scores

Draw the confusion matrix

Observations:

  1. The hypertuning has improved the scores on ADA Boost.
  2. Model is able to identify 8.59% of true positives.

Gradient Boost Classifier

Check the scores

Draw the confusion matrix

Observations:

  1. The metric are comparable between training and testing set.
  2. F1 Score is better than default ADA Boost.
  3. Model is able to identify 6.95% out of 13% true positives.

Gradient Boost with ADA Boost Classifier

Check the scores

Draw the confusion matrix

Observations:

  1. F1 Score has reduced.
  2. The results are still comparable between training and testing sets.

Hypertuned Gradient Boost Classifier

Check the scores

Draw the confusion matrix

Observations:

  1. F1 Score has improved on both training and testing sets.
  2. Model is able to identify 7.84% of true positives.

XG Boost Classifier (Extreme Gradient Boost Classifier)

Check the scores

Draw the confusion matrix

Observations:

  1. The model is over fitting on training set.
  2. F1 Score has improved over earlier models.
  3. The model is able to identify 10.70% true positives.

Hypertuned XG Boost Classifier

Check the scores

Draw the confusion matrix

Observations:

  1. Hypertuning has reduced the metrics.
  2. Model is able to identify 9.34% true positive.
  3. Model still tends to overfit.

Stacking Classifier

We can Random forest classifier, Gradient boost classifier and Decision Tree. These three models have least comparable overfitting issue and are good performance metrics.

Check the scores

Draw the confusion matrix

Observations:

  1. F1 score is reduced to 0.54.
  2. The model is not overfitting on testing set.
  3. Accuracy of the model is 0.85.

Compare all models

Observations:

  1. Bagging with Decision Tree has the highest F1 Score but it is overfitting the data.
  2. Despite having lower F1 Score, the Hypertuned Random Forest has more generalized metrics and does not seem to be over-fitting the data. Making it the best suited for future analysis.
  3. Most of the models show comparable scores between training and testing sets.

Conclusion

  1. A Key missing variable is if the Product pitched was the same product that was bought.
  2. Basic and Deluxe are the most popular packages.
  3. There was imbalance in data, as only 18% of customers bought any product. This must be fixed for future analysis.
  4. NumberofChilden and NumberofPeoplevisiting does not seem to impact the performance of model a lot.
  5. The company can run the model to achieve desired performance levels for new data, also to offer better packages to customers.
  6. Young and single people are more likely to buy the offered packages.
  7. Age and Income have a correlation and we see that higher age groups and higher Monthly Income groups lean towards the expensive packages.

Recommendations

  1. The marketing team can curate the individual packages to the specific business designation
  2. The marketing team can create product and customer segment specific sale pitch to reduce the DurationOfPitch.
  3. The WELLNESS TOURISM PACKAGE should be curated considering the features of existing packages that customers have purchased.
  4. The company can run various campaigns and offers for customers with family to increase sales.
  5. The data shows customers with passport has higher buying ratio and business can curate international packages for such customers.
  6. Specific packages can be created for different income groups.
  7. The data collection process can be enhanced to capture additional information like customer satisfaction post tour and data to correlate the product pitched to the product actually purchased by customers.